Multiple Sequence Comparison: A Peptide Matching Approach

نویسندگان

  • Marie-France Sagot
  • Alain Viari
  • Henry Soldano
چکیده

We present in this paper a peptide matching approach to the multiple comparison of a set of protein sequences. This approach consists in looking for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as reference an object called a model. In the case of proteins, a model is a product of subsets of the alphabet Σ of the amino acids. These subsets belong to a cover of Σ, that is, their union covers all of Σ. A word is said to be an instance of a model if it belongs to the model. A further flexibility is introduced in the comparison by allowing for up to e errors in the comparison between a word and a model. These errors may concern gaps or substitutions not allowed by the cover. A word is said to be this time an occurrence of a model if the Levenshtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a Set-Levenshtein distance between the occurrences and the model itself. Two words are said to be similar if there is at least one model of which both are occurrences. In the special case where e = 0, the occurrences of a model are simply its instances. If a model M has occurrences in at least q of the sequences of the set, M is said to occur in the set. The algorithm presented here is an efficient and exact way of looking for all the models, of a fixed length k or of the greatest possible length kmax, that occur in a set of sequences. It is linear in the total length n of the sequences and proportional to (e + 2).(2e + 1).k.p.g where k << n is a small value in all practical situations, p is the number of sets in the cover and g is related to the latter’s degree of non transitivity. Models are closely related to what is called a consensus in the biocomputing area, and covers are a good way of representing complex relationships between the amino acids. keywords: multiple comparison, cover, model, Levenshtein distance, Set-Levenshtein distance

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Study on Text Pattern Matching for Heterogeneous System

Shikha Pandey Asst. Professor (CSE) Rungta College Of Engineering & Technology Bhilai, Chhattisgarh, INDIA [email protected] Abstract— Pattern-matching has been routinely used in various computer applications, for example, in editors, retrieval of information either textual, image, or sound and searching nucleotide or amino acid sequence patterns in genome and protein sequence databases...

متن کامل

Using Multiple-Variable Matching to Identify EFL Ecological Sources of Differential Item Functioning

Context is a vague notion with numerous building blocks making language test scores inferences quite convoluted. This study has made use of a model of item responding that has striven to theorize the contextual infrastructure of differential item functioning (DIF) research and help specify the sources of DIF. Two steps were taken in this research: first, to identify DIF by gender grouping via l...

متن کامل

Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence

MOTIVATION Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, 'intermediate' sequence (Q --> I --> M ). This approach has often been suggested as providing greater sensitivity i...

متن کامل

PepTiger: Search Engine for Error-Tolerant Protein Identification from de Novo Sequences

In recent years a number of de novo sequencing software products became available providing possible partial or complete amino acid sequence tags for MS/MS spectra of peptides. However, for a variety of reasons including spectral chemical noise and imperfect fragmentation these sequence tags almost always contain errors. Additional difficulties arise from actual protein sequence variation and p...

متن کامل

EMAGEN: An Efficient Approach to Multiple Whole Genome Alignment

Following advances in biotechnology, many new whole genome sequences are becoming available every year. A lot of useful information can be derived from the alignment and comparison of different genomes. However, most of the current research focuses on pairwise genome alignment, and only a few available applications can efficiently align multiple genomes. In this paper, we present an efficient a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 180  شماره 

صفحات  -

تاریخ انتشار 1995